flowchart LR D[Data exported\nfrom REDCAP]-->C[Data cleaning] C-->S[Statistical analysis] S-->P[Presentation\nof results]
Jeremy Lew
August 20, 2024
flowchart LR D[Data exported\nfrom REDCAP]-->C[Data cleaning] C-->S[Statistical analysis] S-->P[Presentation\nof results]
install.packages("palmerpenguins") in the consolelibrary(palmerpenguins)remove.packages("palmerpenguins") in the consoleConsole: where you run/execute lines of code
To assign a variable, we use the <- operator
Environment: where you’re able to see variables stored in random-access memory
rm(list = ls()) in consolepalmerpenguins and tidyverse packages. Install them if you have not already done sopenguins dataset from palmerpenguins packagebody_mass_g (continuous) into a categorical variable with 2 categories (“light”: <= 4750g and “heavy”: > 4750g)body_mass_g as the dependent variable and independent variables: species, bill_length_mm, bill_depth_mm, flipper_length_mm and sexFactor datatype for working with categorical variables
A factor variable is more useful than a plain character vector. E.g. the first level will be used by glm as the reference category
Data structures are containers for holding data
A character vector holds multiple characters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
A numeric vector holds multiple numbers
Exercise
length(letters)letters[1:5]letters[c(1, 9, 20)]letters[length(letters):1]Unnamed lists
Named lists
Exercise
mylist <- list(a = "a", b = 1, c = 10:15)length(mylist)mylist[1]mylist$a or mylist[["a"]]mylist[3] and mylist[[3]]class(mylist[3]) and class(mylist[[3]])Dataframe is a tabular container that can hold multiple data types like lists, but each column can only store data of the same data type
To view the first 2 rows of the penguins dataset from the palmerpenguins package
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
# ℹ 2 more variables: sex <fct>, year <int>
To view the last 2 rows of the dataset
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 50.8 19 210 4100
2 Chinstrap Dream 50.2 18.7 198 3775
# ℹ 2 more variables: sex <fct>, year <int>
To view the data in rstudio, execute view(penguins)
Exercise
penguins dataset using dim(penguins), ncol(penguins), nrow(penguins)?glimpse(penguins)?summary(penguins)?To get the column/variable-names of your dataset
To access any column variable, we use the $ syntax
To get a table count of a variable or a cross-table count of 2 variables
See Data Cleaning for more ways to work with dataframes
read_csvread_excelhaven packagelabelled packageA good chapter on explaining functions is Chapter 6, Advanced R, where most of the examples in the slides are taken
Example function
Components of a function
Functions are objects that can be assigned or they can be anonymous functions
How do we invoke the function?
How do we invoke one function after another?
Suppose we have another function f02
To call f01 first followed by f02 and sqrt we will “nest” the functions (inside-out, right to left)
Another way to invoke a series of functions is to use the pipe operator %>% provided by the magrittr package
Exercise
f02 and f02(1, 2)?What is returned by a function? The last evaluated expression
How do we store values outputted from a function? With assignment statment
Early termination of a function with a return statement
Pass by copy vs pass by reference
In R, all functions are pass by reference if you don’t modify the object subsequently within the function. i.e. copy-on-modify
How to pass arguments?
By position e.g. f04(10, 50, 20)
By name e.g. f04(z = 20, y = 50, x = 10)
Unpacking multiple named arguments using do.call
Names defined inside a function mask names defined outside a function
R will look up a variable in the enclosing scope (e.g. within the function) and if it’s not found, will continue to proceed upwards (e.g. enclosing function or global environment) to look for the variable until it is found
R looks for the values when the function is run, not when the function is created
The select function enables you to select a subset of the columns in your dataset
The filter function enables you to select a subset of the rows that meet a certain criteria
# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 39.3 20.6 190 3650
# ℹ 2 more variables: sex <fct>, year <int>
# A tibble: 165 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.5 17.4 186 3800
2 Adelie Torgersen 40.3 18 195 3250
3 Adelie Torgersen 36.7 19.3 193 3450
4 Adelie Torgersen 38.9 17.8 181 3625
5 Adelie Torgersen 41.1 17.6 182 3200
6 Adelie Torgersen 36.6 17.8 185 3700
7 Adelie Torgersen 38.7 19 195 3450
8 Adelie Torgersen 34.4 18.4 184 3325
9 Adelie Biscoe 37.8 18.3 174 3400
10 Adelie Biscoe 35.9 19.2 189 3800
# ℹ 155 more rows
# ℹ 2 more variables: sex <fct>, year <int>
The mutate function enables you to add columns to your dataset. The added columns can be derived from existing column(s)
# A tibble: 5 × 2
body_mass_g body_mass_100g
<int> <dbl>
1 3750 37.5
2 3800 38
3 3250 32.5
4 NA NA
5 3450 34.5
The summarise function enables you to get summary statistics like N, mean, median etc
# A tibble: 1 × 2
N mean_body_mass_g
<int> <dbl>
1 344 NA
The group_by function enables you to get summary statistics within groups
# A tibble: 3 × 4
sex N mean_body_mass_g median_body_mass_g
<fct> <int> <dbl> <dbl>
1 female 165 3862. 3650
2 male 168 4546. 4300
3 <NA> 11 NA NA
The function arrange enables us to sort by a certain variable
# A tibble: 9 × 3
# Groups: sex [3]
sex year mean_body_mass_g
<fct> <int> <dbl>
1 female 2009 3874.
2 male 2009 4521.
3 <NA> 2009 NA
4 female 2008 3888.
5 male 2008 4632.
6 <NA> 2008 4650
7 female 2007 3821.
8 male 2007 4479.
9 <NA> 2007 NA
The functions if_else and case_when are often used with mutate to create new variables
# A tibble: 4 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Chinstrap Dream 43.5 18.1 202 3400
2 Gentoo Biscoe 43.6 13.9 217 4900
3 Gentoo Biscoe 48.1 15.1 209 5500
4 Gentoo Biscoe 46.8 14.3 215 4850
# ℹ 3 more variables: sex <fct>, year <int>, bill_length_type <chr>
If we convert a character vector to a factor vector, the levels will be arranged in alphabetical order automatically if unspecified
We can specify the levels of the factor using the levels argument
penguins %>%
mutate(bill_length_type = case_when(
bill_length_mm < 40 ~ "short bill",
between(bill_length_mm, 40, 50) ~ "medium bill",
bill_length_mm > 50 ~ "long bill",
TRUE ~ NA_character_) %>%
factor(levels = c("short bill", "medium bill", "long bill"))
) %>%
.$bill_length_type %>% levels()[1] "short bill" "medium bill" "long bill"
The factor vector will only contain values specified by levels. Any existing character values unspecified in levels (e.g. “long bill”) will be converted to NA
penguins %>%
mutate(bill_length_type = case_when(
bill_length_mm < 40 ~ "short bill",
between(bill_length_mm, 40, 50) ~ "medium bill",
bill_length_mm > 50 ~ "long bill",
TRUE ~ NA_character_) %>%
factor(levels = c("short bill", "medium bill"))
) %>%
.$bill_length_type %>% levels()[1] "short bill" "medium bill"
fct_relevel relevels levels of a factor vector (if not already a factor eg. character vector, it will first convert it to a factor vector) Any unspecified levels will follow after specified ones, left in the existing order
penguins %>%
mutate(bill_length_type = case_when(
bill_length_mm < 40 ~ "short bill",
between(bill_length_mm, 40, 50) ~ "medium bill",
bill_length_mm > 50 ~ "long bill",
TRUE ~ NA_character_) %>%
fct_relevel("short bill", "medium bill")
) %>%
.$bill_length_type %>% levels()[1] "short bill" "medium bill" "long bill"
We can collapse levels of a factor variable using forcat’s fct_collapse function
The island variable has 3 levels
Suppose we want to collapse island into 2 levels, we can apply fct_collapse like so
Suppose we have some data in long format
# Generate fake patient data for illustration
generate_fake_patient_data <- function(id) {
data.frame(list(
patient_id = id,
time = c("baseline", "12mth"),
hba1c = runif(n = 2, min = 4, max = 14),
ldl = sample(c("good control", "bad control"))
))
}
set.seed(2024)
data <- map(
sprintf("P%s", str_pad(1:344, width = 2, pad = 0)),
generate_fake_patient_data
) %>%
bind_rows()
print(head(data, 10)) patient_id time hba1c ldl
1 P01 baseline 12.369425 good control
2 P01 12mth 7.208675 bad control
3 P02 baseline 8.570092 good control
4 P02 12mth 11.014203 bad control
5 P03 baseline 12.765881 bad control
6 P03 12mth 5.190482 good control
7 P04 baseline 11.035531 bad control
8 P04 12mth 9.157855 good control
9 P05 baseline 5.103111 bad control
10 P05 12mth 12.632152 good control
We can make use of tidyr’s pivot_wider to reshape our data from long to wide
data_wide <- data %>%
pivot_wider(id_cols = patient_id,
names_from = "time",
names_glue = "{.value}_{time}",
values_from = c(hba1c, ldl))
head(data_wide, 15)# A tibble: 15 × 5
patient_id hba1c_baseline hba1c_12mth ldl_baseline ldl_12mth
<chr> <dbl> <dbl> <chr> <chr>
1 P01 12.4 7.21 good control bad control
2 P02 8.57 11.0 good control bad control
3 P03 12.8 5.19 bad control good control
4 P04 11.0 9.16 bad control good control
5 P05 5.10 12.6 bad control good control
6 P06 4.47 5.12 bad control good control
7 P07 12.0 6.92 good control bad control
8 P08 4.01 4.21 bad control good control
9 P09 5.76 13.7 bad control good control
10 P10 4.54 13.7 good control bad control
11 P11 12.7 13.1 good control bad control
12 P12 12.5 8.48 bad control good control
13 P13 10.5 10.5 bad control good control
14 P14 8.53 8.89 bad control good control
15 P15 4.20 5.92 good control bad control
Suppose instead that we were given data in a wide format
# A tibble: 10 × 5
patient_id hba1c_baseline hba1c_12mth ldl_baseline ldl_12mth
<chr> <dbl> <dbl> <chr> <chr>
1 P01 12.4 7.21 good control bad control
2 P02 8.57 11.0 good control bad control
3 P03 12.8 5.19 bad control good control
4 P04 11.0 9.16 bad control good control
5 P05 5.10 12.6 bad control good control
6 P06 4.47 5.12 bad control good control
7 P07 12.0 6.92 good control bad control
8 P08 4.01 4.21 bad control good control
9 P09 5.76 13.7 bad control good control
10 P10 4.54 13.7 good control bad control
We can make use of tidyr’s pivot_longer to reshape our data into long format
data_wide %>%
pivot_longer(
cols = matches("baseline|12mth"),
names_pattern = "(.*)_(baseline|12mth)",
names_to = c(".value", "time")
)# A tibble: 688 × 4
patient_id time hba1c ldl
<chr> <chr> <dbl> <chr>
1 P01 baseline 12.4 good control
2 P01 12mth 7.21 bad control
3 P02 baseline 8.57 good control
4 P02 12mth 11.0 bad control
5 P03 baseline 12.8 bad control
6 P03 12mth 5.19 good control
7 P04 baseline 11.0 bad control
8 P04 12mth 9.16 good control
9 P05 baseline 5.10 bad control
10 P05 12mth 12.6 good control
# ℹ 678 more rows
This combination of mutate & across enables us to apply a function to multiple columns of a dataframe
penguins %>%
mutate(across(.cols = c(bill_length_mm, bill_depth_mm),
.fns = ~ floor(.x),
.names = "{.col}_floored")) %>%
select(matches("bill_(length|depth)"))# A tibble: 344 × 4
bill_length_mm bill_depth_mm bill_length_mm_floored bill_depth_mm_floored
<dbl> <dbl> <dbl> <dbl>
1 39.1 18.7 39 18
2 39.5 17.4 39 17
3 40.3 18 40 18
4 NA NA NA NA
5 36.7 19.3 36 19
6 39.3 20.6 39 20
7 38.9 17.8 38 17
8 39.2 19.6 39 19
9 34.1 18.1 34 18
10 42 20.2 42 20
# ℹ 334 more rows
If we omit the .names argument, the function will be applied in-place
We can make use of regular expressions in stringr’s str_detect function together with filter, or if_else/case_when for data cleaning
Here we create a species_plus variable
set.seed(2024)
penguins <- penguins %>%
mutate(across(
.cols = "species",
.fns = ~ paste(.x,
sample(c("baby", "teenager", "adult"), nrow(penguins), replace = TRUE),
sex,
sep = "_"
),
.names = "{.col}_plus"
))
# view counts of species_plus
table(penguins$species_plus)
Adelie_adult_female Adelie_adult_male Adelie_adult_NA
22 17 3
Adelie_baby_female Adelie_baby_male Adelie_baby_NA
27 33 1
Adelie_teenager_female Adelie_teenager_male Adelie_teenager_NA
24 23 2
Chinstrap_adult_female Chinstrap_adult_male Chinstrap_baby_female
14 11 7
Chinstrap_baby_male Chinstrap_teenager_female Chinstrap_teenager_male
11 13 12
Gentoo_adult_female Gentoo_adult_male Gentoo_adult_NA
25 20 2
Gentoo_baby_female Gentoo_baby_male Gentoo_baby_NA
21 22 1
Gentoo_teenager_female Gentoo_teenager_male Gentoo_teenager_NA
12 19 2
We can use a regular expression to identify categories
penguins %>%
mutate(across(
.cols = "species_plus",
.fns = ~ case_when(
# Chinstrap/Gentoo, not followed by baby, male/female
str_detect(.x, "((Chinstrap|Gentoo)(?!_baby)).*(male|female)") ~ "DO SOMETHING",
TRUE ~ .x # maintain existing value
),
.names = "{.col}2"
)) %>%
.$species_plus2 %>%
table().
Adelie_adult_female Adelie_adult_male Adelie_adult_NA
22 17 3
Adelie_baby_female Adelie_baby_male Adelie_baby_NA
27 33 1
Adelie_teenager_female Adelie_teenager_male Adelie_teenager_NA
24 23 2
Chinstrap_baby_female Chinstrap_baby_male DO SOMETHING
7 11 126
Gentoo_adult_NA Gentoo_baby_female Gentoo_baby_male
2 21 22
Gentoo_baby_NA Gentoo_teenager_NA
1 2
The ggplot2 package is well known for plotting figures
| Variable 1 | Variable 2 | Bivariate test |
|---|---|---|
| Categorical | Categorical | 1. Chi-square test 2. Fisher’s exact test |
| Categorical | Continuous | Parametric: 1. Independent t-test (2 categories) 2. One-way independent ANOVA (>2 categories) |
| Categorical | Continuous | Non-parametric: 1. Mann-Whitney test (2 categories) a.k.a. Wilcoxon rank-sum test 2. Kruskal-Wallis test (>2 categories) |
| Continuous | Continuous | Parametric: Pearson’s correlation coefficient |
| Continuous | Continuous | Non-parametric: Spearman’s correlation coefficient |
| Variable 1 | Variable 2 | Bivariate test |
|---|---|---|
| Categorical | Categorical time (e.g. baseline, 12-month) |
McNemar’s test |
| Continuous | Categorical time (e.g. baseline, 12-month) |
Parametric: Dependent t-test |
| Continuous | Categorical time (e.g. baseline, 12-month) |
Non-parametric: Wilcoxon signed-rank test |
tableby function of the arsenal package lets you do this easilylibrary(arsenal)
1tableby(species ~ sex + island + bill_length_mm,
data = penguins,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"))) %>%
summary(text = TRUE) %>%
knitr::kable(align = c("l", rep("r", 5)))tableby function invoked by these 3 lines
| Adelie (N=152) | Chinstrap (N=68) | Gentoo (N=124) | Total (N=344) | p value | |
|---|---|---|---|---|---|
| sex | 0.976 | ||||
| - N-Miss | 6 | 0 | 5 | 11 | |
| - female | 73 (50.0%) | 34 (50.0%) | 58 (48.7%) | 165 (49.5%) | |
| - male | 73 (50.0%) | 34 (50.0%) | 61 (51.3%) | 168 (50.5%) | |
| island | < 0.001 | ||||
| - Biscoe | 44 (28.9%) | 0 (0.0%) | 124 (100.0%) | 168 (48.8%) | |
| - Dream | 56 (36.8%) | 68 (100.0%) | 0 (0.0%) | 124 (36.0%) | |
| - Torgersen | 52 (34.2%) | 0 (0.0%) | 0 (0.0%) | 52 (15.1%) | |
| bill_length_mm | < 0.001 | ||||
| - N-Miss | 1 | 0 | 1 | 2 | |
| - Mean (SD) | 38.791 (2.663) | 48.834 (3.339) | 47.505 (3.082) | 43.922 (5.460) | |
| - Median (Q1, Q3) | 38.800 (36.750, 40.750) | 49.550 (46.350, 51.075) | 47.300 (45.300, 49.550) | 44.450 (39.225, 48.500) | |
| - Range | 32.100 - 46.000 | 40.900 - 58.000 | 40.900 - 59.600 | 32.100 - 59.600 |
Suppose we have repeated measures of HbA1c and LDL from patients at e.g. baseline and 12-month
generate_fake_patient_data <- function(id) {
data.frame(list(
patient_id = id,
time = c("baseline", "12-month"),
hba1c = runif(n = 2, min = 4, max = 14),
ldl = sample(c("good control", "bad control"))
))
}
set.seed(2024)
data <- map(
sprintf("P%s", str_pad(1:30, width = 2, pad = 0)),
generate_fake_patient_data
) %>%
bind_rows() %>%
mutate(across(.cols = time,
.fns = ~ fct_relevel(.x, "baseline", "12-month")))
print(head(data, 10)) patient_id time hba1c ldl
1 P01 baseline 12.369425 good control
2 P01 12-month 7.208675 bad control
3 P02 baseline 8.570092 good control
4 P02 12-month 11.014203 bad control
5 P03 baseline 12.765881 bad control
6 P03 12-month 5.190482 good control
7 P04 baseline 11.035531 bad control
8 P04 12-month 9.157855 good control
9 P05 baseline 5.103111 bad control
10 P05 12-month 12.632152 good control
The paired function from arsenal package helps us tabulate pre-post tests easily
# pacman::p_load(arsenal, kableExtra)
table <- paired(
time ~ signed.rank(hba1c) + ldl,
data = data,
id = patient_id,
control = paired.control(digits = 2,
numeric.stats = c("Nmiss2", "meansd", "medianq1q3", "range"),
numeric.test = "signed.rank",
signed.rank.correct = FALSE))
table %>%
summary(text = TRUE) %>%
kable(align = c("l", rep("r", 4))) %>%
column_spec(column = 1, width_min = "4cm") %>%
column_spec(column = 2:4, width_min = "3cm") %>%
column_spec(column = 5, width_min = "2.5cm")| baseline (N=30) | 12-month (N=30) | Difference (N=30) | p value | |
|---|---|---|---|---|
| hba1c | 0.730 | |||
| - N-Miss | 0 | 0 | 0 | |
| - Mean (SD) | 8.45 (3.09) | 8.76 (3.09) | 0.31 (4.78) | |
| - Median (Q1, Q3) | 7.81 (5.80, 11.30) | 8.66 (6.05, 11.78) | 0.29 (-2.44, 3.74) | |
| - Range | 4.01 - 12.77 | 4.21 - 13.73 | -7.58 - 9.12 | |
| ldl | 0.855 | |||
| - bad control | 14 (46.7%) | 16 (53.3%) | 14 (100.0%) | |
| - good control | 16 (53.3%) | 14 (46.7%) | 16 (100.0%) |
# A tibble: 5 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Dream 35.6 17.5 191 3175
2 Chinstrap Dream 50.7 19.7 203 4050
3 Adelie Torgersen 39.7 18.4 190 3900
4 Gentoo Biscoe 43.2 14.5 208 4450
5 Gentoo Biscoe 47.2 13.7 214 4925
# ℹ 3 more variables: sex <fct>, year <int>, species_plus <chr>
flowchart LR A[Model specification] --> B[Inference] --> C[Diagnostics]
GLM consists of 2 components
Component 1: What distribution does your outcome, \(y_i\) take on?
\[\begin{align} y_i|X_i \sim \text{member of exponential family} \ \text{(e.g. $y_i$ is normally distributed)} \end{align}\]
Component 2: What is the link function \(g\) you want to use?
\[\begin{align} g(\mathbb{E}(y_i|X_i)) = X_i^{T} \beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align}\]
In a generalised linear model, the outcome variable \(y_1,...,y_n\) are modelled as independent and identically distributed (iid) observations
The outcome variable is treated as a random variable (i.e. outcome takes on a certain distribution e.g. normal), but NOT the independent variables
When our outcome, \(y_i\) is continuous and takes on real values (i.e. \(y_i \in \mathbb{R}\)), we may choose to model \(y_i\) with a normal distribution
Component 1: \(y_i|x_i\) is normally distributed
\[ y_i | x_i \sim \mathcal{N}(X_i^{T}\beta, \sigma^2) \]
Component 2: link function is the identity function i.e. no transformation done
\[ \mathbb{E}(y_i|x_i) = X_i^{T}\beta = \beta_0 + \beta_1age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots \]
Notice the use of the formula syntax. LHS of the ~ is the dependent variable, while RHS are the independent variables
Notice the family argument contains the 2 components of the GLM we discussed earlier
model <- glm(body_mass_g ~ species + bill_length_mm +
bill_depth_mm +
flipper_length_mm + sex,
data = penguins,
family = gaussian(link = "identity"))
summary(model)
Call:
glm(formula = body_mass_g ~ species + bill_length_mm + bill_depth_mm +
flipper_length_mm + sex, family = gaussian(link = "identity"),
data = penguins)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1460.995 571.308 -2.557 0.011002 *
speciesChinstrap -251.477 81.079 -3.102 0.002093 **
speciesGentoo 1014.627 129.561 7.831 6.85e-14 ***
bill_length_mm 18.204 7.106 2.562 0.010864 *
bill_depth_mm 67.218 19.742 3.405 0.000745 ***
flipper_length_mm 15.950 2.910 5.482 8.44e-08 ***
sexmale 389.892 47.848 8.148 7.97e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 82563.33)
Null deviance: 215259666 on 332 degrees of freedom
Residual deviance: 26915647 on 326 degrees of freedom
(11 observations deleted due to missingness)
AIC: 4723.9
Number of Fisher Scoring iterations: 2
We can call coef and summary to access the beta coefficients and p-values
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1460.99463 571.308144 -2.557280 1.100170e-02
speciesChinstrap -251.47669 81.078824 -3.101632 2.092677e-03
speciesGentoo 1014.62666 129.560586 7.831291 6.852265e-14
bill_length_mm 18.20443 7.106258 2.561746 1.086405e-02
bill_depth_mm 67.21763 19.741850 3.404829 7.447574e-04
flipper_length_mm 15.95025 2.909612 5.481915 8.440786e-08
sexmale 389.89153 47.848346 8.148485 7.971622e-15
We can call confint.default to get the confidence intervals
tabulate_glm_result function in the ezyr package to tabulate our results# library(ezyr)
model %>%
tabulate_glm_result() %>%
mutate(across(.cols = "variable_name",
.fns = ~ case_when(var0 == "header" ~ sprintf("**%s**", .x),
TRUE ~ sprintf("- %s", .x)))) %>%
mutate(across(.cols = "Beta (95% CI)",
.fns = ~ case_when(var0 == "Ref" ~ "Ref",
TRUE ~ .x))) %>%
# bold p-values if < 0.05
mutate(across(.col = "p_value_",
.fns = ~ case_when(p_value < 0.05 ~ sprintf("**%s**", .x),
TRUE ~ .x))) %>%
select(variable_name, "Beta (95% CI)", p_value_) %>%
rename("Independent variables" = variable_name,
"p-value" = p_value_) %>%
kable(align = c("l", rep("r", 2))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
fixed_thead = T, full_width = T)| Independent variables | Beta (95% CI) | p-value |
|---|---|---|
| **species** | NA | NA |
| - Adelie | Ref | NA |
| - Chinstrap | -251.48 (-410.39 to -92.57) | **0.002** |
| - Gentoo | 1014.63 (760.69 to 1268.56) | **<0.001** |
| **bill_length_mm** | 18.20 (4.28 to 32.13) | **0.01** |
| **bill_depth_mm** | 67.22 (28.52 to 105.91) | **<0.001** |
| **flipper_length_mm** | 15.95 (10.25 to 21.65) | **<0.001** |
| **sex** | NA | NA |
| - female | Ref | NA |
| - male | 389.89 (296.11 to 483.67) | **<0.001** |
When our outcome, \(y_i\), is discrete and dichotomous (i.e. take two discrete values, 0-1, success-failure), we can model \(y_i\) with a Bernoulli distribution
Component 1: \(y_i|x_i\) is Bernoulli distributed
\[ y_i | x_i \sim \mathrm{Bernoulli}(\pi_i) \]
Component 2: Link function, \(g\), is the logit function
\[\begin{align} g(\mathbb{E}(y_i|x_i)) &= \underbrace{\mathrm{ln} \left( \frac{\mathbb{E}(y_i|x_i)}{1-\mathbb{E}(y_i|x_i)} \right)}_{logit[\mathbb{E}(y_i|x_i)]} = X_i^{T}\beta = \underbrace{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}_{\text{this linear combination is continuous}} \end{align}\]
Suppose we classify penguins into light & heavy based on body mass
Run logistic regression model with body_mass_binary as the dependent variable
Based on the factor levels, light: 0 and heavy: 1
# Classify penguins into "light" & "heavy"
penguins <- penguins %>%
mutate(body_mass_binary = if_else(body_mass_g < 4750, "light", "heavy") %>%
fct_relevel("light", "heavy"))
# Run logistic regression model
log_regression_model <- glm(body_mass_binary ~ species + bill_length_mm +
bill_depth_mm + flipper_length_mm + sex,
data = penguins,
family = binomial(link = "logit"))
summary(log_regression_model)
Call:
glm(formula = body_mass_binary ~ species + bill_length_mm + bill_depth_mm +
flipper_length_mm + sex, family = binomial(link = "logit"),
data = penguins)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -82.16706 1781.82645 -0.046 0.96322
speciesChinstrap -2.78221 2.09273 -1.329 0.18369
speciesGentoo 20.56985 1781.74802 0.012 0.99079
bill_length_mm 0.19322 0.14724 1.312 0.18944
bill_depth_mm 0.53003 0.48733 1.088 0.27676
flipper_length_mm 0.21164 0.07272 2.910 0.00361 **
sexmale 18.05158 1781.74594 0.010 0.99192
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 386.628 on 332 degrees of freedom
Residual deviance: 82.135 on 326 degrees of freedom
(11 observations deleted due to missingness)
AIC: 96.135
Number of Fisher Scoring iterations: 20
We can call coef and summary as before but to get odds ratios we must exponentiate the beta coefficients using exp
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.066808e-36 Inf 0.9549331 2.620118
speciesChinstrap 6.190135e-02 8.106997 0.2646181 1.201648
speciesGentoo 8.577706e+08 Inf 1.0116117 2.693358
bill_length_mm 1.213144e+00 1.158635 3.7144331 1.208577
bill_depth_mm 1.698985e+00 1.627963 2.9672136 1.318852
flipper_length_mm 1.235701e+00 1.075431 18.3614232 1.003618
sexmale 6.913539e+07 Inf 1.0101829 2.696397
We must likewise exponentiate the confidence intervals of the beta coefficients to get the confidence intervals of the odds ratios
tabulate_glm_result function in the ezyr package to tabulate our resultsexponentiate = TRUE as we need to exponentiate the beta coefficients to get odds ratioslog_regression_model %>%
tabulate_glm_result(exponentiate = TRUE) %>%
mutate(across(.cols = "variable_name",
.fns = ~case_when(var0 == "header" ~ sprintf("**%s**", .x),
TRUE ~ sprintf("- %s", .x)))) %>%
mutate(across(.cols = "exp(Beta) (exp(95% CI))",
.fns = ~case_when(var0 == "Ref" ~ "Ref",
TRUE ~ .x))) %>%
mutate(across(.col = "p_value_",
.fns = ~case_when(p_value < 0.05 ~ sprintf("**%s**", .x),
TRUE ~ .x))) %>%
select(variable_name, "exp(Beta) (exp(95% CI))", p_value_) %>%
rename("Independent variables" = variable_name,
"Odds ratio (95% CI)" = "exp(Beta) (exp(95% CI))",
"p-value" = p_value_) %>%
kable(align = c("l", rep("r", 2))) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
fixed_thead = T, full_width = T)| Independent variables | Odds ratio (95% CI) | p-value |
|---|---|---|
| **species** | NA | NA |
| - Adelie | Ref | NA |
| - Chinstrap | 0.06 (0.00 to 3.74) | 0.18 |
| - Gentoo | 857770649.49 (0.00 to Inf) | 0.99 |
| **bill_length_mm** | 1.21 (0.91 to 1.62) | 0.19 |
| **bill_depth_mm** | 1.70 (0.65 to 4.42) | 0.28 |
| **flipper_length_mm** | 1.24 (1.07 to 1.42) | **0.004** |
| **sex** | NA | NA |
| - female | Ref | NA |
| - male | 69135385.69 (0.00 to Inf) | 0.99 |
Maximum likelihood estimation (MLE) is the way to infer the beta coefficients
The idea behind MLE is that we want to maximise the joint probability of our observed data assuming that the data was generated according to our model i.e. we want to maximise the likelihood function \(\mathcal{L}(\beta) = P(y_1, y_2, ..., y_n)\).
In the case of logistic regression, the likelihood function \(\mathcal{L}(\beta)\) is: \[\begin{align} \mathcal{L}(\beta) &= P(y_1, y_2, ..., y_n) \\ &= \prod_{i=1}^{n} P(y_i) \\ &= \prod_{i=1}^{n} \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\ \end{align}\]
Convert the likelihood to log-likehood to make it easier to work with (we can do that because log is a monotonic function) \[\begin{align} \mathrm{ln}\ \mathcal{L}(\beta) % &= \mathrm{ln} \prod_{i=1}^{n} \pi_i^{y_i} (1-\pi_i)^{1-y_i} \\ % &= \sum_{i=1}^{n} \left[ y_i\mathrm{ln}\pi_i + (1-y_i)\mathrm{ln}(1-\pi_i) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \mathrm{ln} \left( \frac{e^{\beta^{T} x_i}}{1+e^{\beta^{T} x_i}} \right) + (1-y_i) \mathrm{ln} \left( 1-\frac{e^{\beta^{T} x_i}}{1+e^{\beta^{T} x_i}} \right) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - y_i \mathrm{ln}(1+e^{\beta^{T}x_i}) + (1-y_i) \mathrm{ln} \left( \frac{1}{1+e^{\beta^{T}x_i}} \right) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - y_i \mathrm{ln}(1+e^{\beta^{T}x_i}) - (1-y_i) \mathrm{ln} (1 + e^{\beta^{T}x_i} ) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i \beta^{T}x_i - \mathrm{ln}(1+e^{\beta^{T}x_i}) \right] \\ \end{align}\]
In order to maximise the log-likelihood, we will use the iteratively reweighted least squares (IRLS) algorithm which is a Newton-Raphson like method \[\begin{align} \beta^{(t+1)} = \beta^{(t)} - \textit{H}^{-1}(\beta^{(t)}) \nabla_{\beta} \mathcal{L}(\beta^{(t)}) \end{align}\] The idea behind this is to update \(\beta\) via a series of iterations such that on every update, we are stepping in the direction of steepest ascent.
To use the IRLS, first we need to calculate the gradient of the log-likelihood \[\begin{align} \nabla_{\beta} \mathrm{ln}\ \mathcal{L}(\beta) &= \sum_{i=1}^{n} \left[ y_i \nabla_\beta \beta^{T}x_i - \nabla_\beta \mathrm{ln}(1+e^{\beta^{T}x_i}) \right] \\ % &= \sum_{i=1}^{n} \left[ y_i x_i - \frac{e^{\beta^{T}x_i}}{1+e^{\beta^{T}x_i}} x_i \right] \\ % &= \sum_{i=1}^{n} (y_i - \pi_i) x_i \\ & = X^{T}(y - \pi) \end{align}\]
Next, we will calculate the Hessian, \(H\), of the log-likelihood by taking derivatives one more time from the gradient. \[\begin{align} H &= -\sum_{i=1}^{n} \pi_i (1-\pi_i) x_i x_i^{T} \\ &= - X^{T} W X \\ \end{align}\]
where \(W\) is a diagonal weight matrix given by: \[\begin{align} W = diag\{\pi_1 (1-\pi_1), \pi_2 (1-\pi_2), ..., \pi_n (1-\pi_n)\} \end{align}\]
Now that we have the components, we can substitute these back into (5) to expand it out \[\begin{align} \beta^{(t+1)} = \beta^{(t)} + (X^{T}W^{(t)}X)^{-1}X^{T}(y - \pi^{(t)}) \end{align}\]
Step 1: Start from the equation linking outcome to indepenent variables \[\begin{align} \mathrm{ln}\left(\underbrace{\frac{\mathbb{E}(y_i|x_i)}{1-\mathbb{E}(y_i|x_i)}}_{odds_i}\right) = \beta_0 &+ \beta_1 age_i + \beta_2 gender^{(male)}_i \\ &+ \beta_3 ethnicity^{(malay)}_i + \beta_4 ethnicity^{(indian)}_i + \cdots \end{align}\]
Step 2: Analyse unit changes
Continuous example: 1 unit increase in age, with everything else held constant. The exponent of \(\beta_1\) gives us the odds ratio for every unit increase in age. \[\begin{align} \mathrm{ln}\ odds_i|_{age=a+1} - \mathrm{ln}\ odds_i|_{age=a} &= \beta_1 (a+1) - \beta_1 (a) \\ \mathrm{ln}\ \frac{odds_i|_{age=a+1}}{odds_i|_{age=a}} &= \beta_1 \\ \frac{odds_i|_{age=a+1}}{odds_i|_{age=a}} &= e^{\beta_1} \end{align}\]
Categorical example: compare to reference category, with everything else held constant. The exponent of \(\beta_3\) gives us the odds ratio of malay compared to chinese. \[\begin{align} \mathrm{ln}\ odds_i|_{ethnicity=malay} - \mathrm{ln}\ odds_i|_{ethnicity=chinese} &= \beta_3 \\ \mathrm{ln}\ \frac{odds_i|_{ethnicity=malay}}{odds_i|_{ethnicity=chinese}} &= \beta_3 \\ \frac{odds_i|_{ethnicity=malay}}{odds_i|_{ethnicity=chinese}} &= e^{\beta_3} \end{align}\]
When our outcome \(y_i\), is continuous and positive, we can model \(y_i\) with a Gamma distribution
Component 1: \(y_i\) is Gamma distributed
\[ y_i | X_i \sim Gamma(\alpha, \beta) \]
Component 2: One choice of the link function, \(g\), is the ln function
\[ \begin{align*} g(\mathbb{E}(y_i|X_i)) = \mathrm{ln}\ \mathbb{E}(y_i|X_i) = \beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots \\ \end{align*} \]
Notes:
The inverse of the ln function is the exponent function. Alternatively, we can also express \(\mathbb{E}(y_i|x_i) = e^{\beta_0 + \beta_1 age_i + \beta_2 gender_i + \beta_3 ethnicity_i + \cdots}\)
From this expression, we can see that the inverse of the link function (i.e. exponent) maps \(X_i^{T}\beta\) to \(\mathbb{E}(y_i|X_i)\). And we know that this is a valid function because \(\mathbb{E}(y_i|X_i)=e^{X_i^{T}\beta} > 0\)
We will use the mpcmp6 R package’s implementation of the CMP\(\mu\)-regression
The random intercept model is given as follows:
\[ \begin{align*} &ln \left(\frac{\pi_{2ij}}{\pi_{1ij}} \right) = \beta_{02} + \beta_{12}x_{ij} + u_{2j} \\ &ln \left(\frac{\pi_{3ij}}{\pi_{1ij}} \right) = \beta_{03} + \beta_{13}x_{ij} + u_{3j} \\ & \pi_{1ij} + \pi_{2ij} + \pi_{3ij} = 1 \end{align*} \]
\[ \begin{bmatrix} u_{2j} \\ u_{3j} \\ \end{bmatrix} \sim N \left( \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} , \begin{pmatrix} \sigma^2_{u2} & \\ \sigma_{u23} & \sigma^2_{u3} \\ \end{pmatrix} \right) \]
Where:
epiR package| Protocol component | Observational study that emulates the RCT |
|---|---|
| Eligibility criteria |
• Apply the same inclusion/exclusion criteria • Assuming that we have variables needed for inclusion/exclusion in our observational dataset • Eligibility should only be determined based on info available at baseline (cf. immortal time bias) • Selection bias if we exclude patients with missing data (analogue of dropout/loss to follow-up in a trial) if dropout is associated with intervention and outcome • Patients could be eligible at multiple time points. We have a choice of (1) single eligible time: set baseline as the first timepoint when eligibility is met within a time period or (2) all eligible times (or large subset): emulate multiple nested trials |
| Treatment strategies |
• Assign patients to strategy consistent with their baseline data • Ensure treatment arms are consistent e.g. both arms contain only new initiators of a treatment as opposed to one arm having patients who are already receiving treatment some time before baseline, which affects the outcome • Positivity: “each individual should have positive probability of receving each level of exposure for every combination of covariates” (i.e. a patient should not be precluded from a level of exposure to the intervention at the outset) • Consistency: “there cannot be two versions of a treatment that would result in different outcomes” |
| Assignment procedures |
• Can only emulate trials without blind assignment because individuals/care providers in the dataset are aware of the received treatments • Patients are not randomly assigned to treatment strategies. methods to achieve exchangeability at baseline e.g. propensity score matching, inverse probability weighting and g-methods to deal with time-varying confounding • Limitation: residual confounding from unmeasured confounders |
| Follow-up period | |
| Outcome | |
| Causal contrast of interests |
• Intention-to-treat effect: we would use patients who were prescribed medication for treatment initiation • Per-protocol effect: we would use patients who adhered to the treatment. Adjustment for postbaseline confounding is needed because postbaseline prognostic factors associated with subsequent adherence to treatment may be affected by prior adherence |
| Analysis plan |
• Proper definition of baseline/time zero is important (affects assessment of eligibility) • Best way to define time zero is the time time when an eligible individual initiates a treatment strategy |
What is the purpose of propensity score matching?
What is a propensity score?
How is propensity score estimated?
How is the matching done?
library(MatchIt)
# Set seed for reproducibility
set.seed(1234)
# Set row names as ID column
lalonde <- lalonde %>% rownames_to_column("ID")
# Conduct propensity score matching
match_obj <- matchit(
treat ~ age + educ + race + married + nodegree + re78,
data = lalonde,
method = "nearest", # nearest neighbour matching
distance = "glm", link = "logit", # default is logistic regression
replace = FALSE, # no replacement, each control unit can only be matched to one treated unit
ratio = 1 # 1:1 matching
)
print(match_obj)A matchit object
- method: 1:1 nearest neighbor matching without replacement
- distance: Propensity score
- estimated with logistic regression
- number of obs.: 614 (original), 370 (matched)
- target estimand: ATT
- covariates: age, educ, race, married, nodegree, re78
# Get matched subjects
matched_ids <- match.data(match_obj)$ID
lalonde_matched <- lalonde %>% filter(ID %in% matched_ids)
tableby(treat ~ age + educ + race + married + nodegree + re78,
data = lalonde,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
digits = 2, numeric.test = "kwt",
total = FALSE, test = FALSE)) %>%
extend_tableby(lalonde, smd = TRUE, smd_digits = 2) %>%
safe_left_join(
tableby(treat ~ age + educ + race + married + nodegree + re78,
data = lalonde_matched,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
digits = 2, numeric.test = "kwt",
total = FALSE, test = FALSE)) %>%
extend_tableby(lalonde_matched, smd = TRUE, smd_digits = 2) %>%
select(-Variable),
by = "marker") %>%
select(-marker) %>%
kable(align = c("l", rep("r", 6))) %>%
column_spec(column = 1, width_min = "4.75cm") %>%
column_spec(column = c(2, 3, 5, 6), width_min = "4.5cm") %>%
column_spec(column = c(4, 7), width_min = "2.5cm") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), fixed_thead = T, full_width = T) %>%
add_header_above(c(" " = 1, "Before matching" = 3, "After matching" = 3))
Before matching
|
After matching
|
|||||
|---|---|---|---|---|---|---|
| Variable | 0 (N=429) | 1 (N=185).x | SMD.x | 0 (N=185) | 1 (N=185).y | SMD.y |
| **age** | 0.24 | 0.05 | ||||
| - Mean (SD) | 28.03 (10.79) | 25.82 (7.16) | 25.40 (10.32) | 25.82 (7.16) | ||
| - Median (Q1, Q3) | 25.00 (19.00, 35.00) | 25.00 (20.00, 29.00) | 20.00 (18.00, 31.00) | 25.00 (20.00, 29.00) | ||
| - Range | 16.00 - 55.00 | 17.00 - 48.00 | 16.00 - 55.00 | 17.00 - 48.00 | ||
| **educ** | 0.04 | < 0.01 | ||||
| - Mean (SD) | 10.24 (2.86) | 10.35 (2.01) | 10.34 (2.61) | 10.35 (2.01) | ||
| - Median (Q1, Q3) | 11.00 (9.00, 12.00) | 11.00 (9.00, 12.00) | 11.00 (9.00, 12.00) | 11.00 (9.00, 12.00) | ||
| - Range | 0.00 - 18.00 | 4.00 - 16.00 | 0.00 - 18.00 | 4.00 - 16.00 | ||
| **race** | 1.70 | 0.86 | ||||
| - black | 87 (20.3%) | 156 (84.3%) | 87 (47.0%) | 156 (84.3%) | ||
| - hispan | 61 (14.2%) | 11 (5.9%) | 43 (23.2%) | 11 (5.9%) | ||
| - white | 281 (65.5%) | 18 (9.7%) | 55 (29.7%) | 18 (9.7%) | ||
| **married** | 0.72 | 0.05 | ||||
| - Mean (SD) | 0.51 (0.50) | 0.19 (0.39) | 0.21 (0.41) | 0.19 (0.39) | ||
| - Median (Q1, Q3) | 1.00 (0.00, 1.00) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | 0.00 (0.00, 0.00) | ||
| - Range | 0.00 - 1.00 | 0.00 - 1.00 | 0.00 - 1.00 | 0.00 - 1.00 | ||
| **nodegree** | 0.24 | 0.06 | ||||
| - Mean (SD) | 0.60 (0.49) | 0.71 (0.46) | 0.68 (0.47) | 0.71 (0.46) | ||
| - Median (Q1, Q3) | 1.00 (0.00, 1.00) | 1.00 (0.00, 1.00) | 1.00 (0.00, 1.00) | 1.00 (0.00, 1.00) | ||
| - Range | 0.00 - 1.00 | 0.00 - 1.00 | 0.00 - 1.00 | 0.00 - 1.00 | ||
| **re78** | 0.08 | 0.04 | ||||
| - Mean (SD) | 6984.17 (7294.16) | 6349.14 (7867.40) | 6633.79 (6888.01) | 6349.14 (7867.40) | ||
| - Median (Q1, Q3) | 4975.51 (220.18, 11688.82) | 4232.31 (485.23, 9643.00) | 5173.52 (461.05, 10122.43) | 4232.31 (485.23, 9643.00) | ||
| - Range | 0.00 - 25564.67 | 0.00 - 60307.93 | 0.00 - 25564.67 | 0.00 - 60307.93 | ||
The use of standardised difference to evaluate covariate balance is preferred over p-values from statistical tests (e.g. t-test, chi-square test) for the reasons as follows
Suppose we have data in the following format24
From section 1.9 of Enders26
- The test does not identify the specific variables that violate MCAR, so it is only useful for testing an omnibus hypothesis that is unlikely to hold in the first place
- The version of the test outlined above assumes that the missing data patterns share a common covariance matrix. MAR and MNAR mechanisms can produce missing data patterns with different variances and covariances, and the test statistic in Equation 1.4 would not necessarily detect covariance based deviations from MCAR
- Simulation studies suggest that Little’s test suffers from low power, particularly when the number of variables that violate MCAR is small, the relationship between the data and missingness is weak, or the data are MNAR. Consequently, the test has a propensity to produce Type II errors and can lead to a false sense of security about the missing data mechanism
Image from R Markdown cookbook chapter 2.1
Create an rmarkdown document
Knit your document into html
The consort package provides a convenient consort_plot function to plot CONSORT diagrams for reporting inclusion/exclusion/allocation (see the vignette)
library(consort)
flow_diagram <- consort_plot(data = penguins %>%
mutate(id = row_number(),
exclusion = case_when(island == "Dream" ~ "Penguins on Dream island",
year == 2007 ~ "Data collected in 2007",
TRUE ~ NA_character_) %>%
fct_relevel("Penguins on Dream island",
"Data collected in 2007")),
orders = c(id = "Study population",
exclusion = "Excluded",
id = "Data for analysis"),
side_box = "exclusion",
cex = 1.2)
plot(flow_diagram)For presenting in html documents, the plot function does not render the plot fully so when that happens, save it as an image first
Next, load your image using knitr::include_graphics
pacman::p_load(arsenal, knitr)
tableby(species ~ sex + island + bill_length_mm,
data = penguins,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
digits = 2)) %>%
summary(text = TRUE) %>%
1knitr::kable(align = c("l", rep("r", 5)))kable function converts the table object to html
| Adelie (N=152) | Chinstrap (N=68) | Gentoo (N=124) | Total (N=344) | p value | |
|---|---|---|---|---|---|
| sex | 0.976 | ||||
| - N-Miss | 6 | 0 | 5 | 11 | |
| - female | 73 (50.0%) | 34 (50.0%) | 58 (48.7%) | 165 (49.5%) | |
| - male | 73 (50.0%) | 34 (50.0%) | 61 (51.3%) | 168 (50.5%) | |
| island | < 0.001 | ||||
| - Biscoe | 44 (28.9%) | 0 (0.0%) | 124 (100.0%) | 168 (48.8%) | |
| - Dream | 56 (36.8%) | 68 (100.0%) | 0 (0.0%) | 124 (36.0%) | |
| - Torgersen | 52 (34.2%) | 0 (0.0%) | 0 (0.0%) | 52 (15.1%) | |
| bill_length_mm | < 0.001 | ||||
| - N-Miss | 1 | 0 | 1 | 2 | |
| - Mean (SD) | 38.79 (2.66) | 48.83 (3.34) | 47.50 (3.08) | 43.92 (5.46) | |
| - Median (Q1, Q3) | 38.80 (36.75, 40.75) | 49.55 (46.35, 51.08) | 47.30 (45.30, 49.55) | 44.45 (39.23, 48.50) | |
| - Range | 32.10 - 46.00 | 40.90 - 58.00 | 40.90 - 59.60 | 32.10 - 59.60 |
The kableExtra package provides useful functions to format tables e.g. changing column width, changing colors, font size, fixed header rows etc.
pacman::p_load(arsenal, knitr, kableExtra)
tableby(species ~ sex + island + bill_length_mm,
data = penguins,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
digits = 2)) %>%
summary(text = TRUE) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 1, width_min = "4.75cm") %>%
kableExtra::column_spec(column = 2:5, width_min = "5.75cm") %>%
kableExtra::column_spec(column = 6, width_min = "2.25cm")| Adelie (N=152) | Chinstrap (N=68) | Gentoo (N=124) | Total (N=344) | p value | |
|---|---|---|---|---|---|
| sex | 0.976 | ||||
| - N-Miss | 6 | 0 | 5 | 11 | |
| - female | 73 (50.0%) | 34 (50.0%) | 58 (48.7%) | 165 (49.5%) | |
| - male | 73 (50.0%) | 34 (50.0%) | 61 (51.3%) | 168 (50.5%) | |
| island | < 0.001 | ||||
| - Biscoe | 44 (28.9%) | 0 (0.0%) | 124 (100.0%) | 168 (48.8%) | |
| - Dream | 56 (36.8%) | 68 (100.0%) | 0 (0.0%) | 124 (36.0%) | |
| - Torgersen | 52 (34.2%) | 0 (0.0%) | 0 (0.0%) | 52 (15.1%) | |
| bill_length_mm | < 0.001 | ||||
| - N-Miss | 1 | 0 | 1 | 2 | |
| - Mean (SD) | 38.79 (2.66) | 48.83 (3.34) | 47.50 (3.08) | 43.92 (5.46) | |
| - Median (Q1, Q3) | 38.80 (36.75, 40.75) | 49.55 (46.35, 51.08) | 47.30 (45.30, 49.55) | 44.45 (39.23, 48.50) | |
| - Range | 32.10 - 46.00 | 40.90 - 58.00 | 40.90 - 59.60 | 32.10 - 59.60 |
If you added labels to your dataframe using the labelled package, it can be presented as subheaders via the labelTranslations argument in summary
labels <- list(sex = "Sex",
island = "Island",
bill_length_mm = "Bill length (mm)")
labelled::var_label(penguins) <- labels
tableby(species ~ sex + island + bill_length_mm,
data = penguins,
control = tableby.control(numeric.stats = c("Nmiss", "meansd", "medianq1q3", "range"),
digits = 2)) %>%
1summary(text = TRUE, labelTranslations = map(labelled::var_label(penguins), ~ str_glue("**{.x}**"))) %>%
kable(align = c("l", rep("r", 5))) %>%
kableExtra::column_spec(column = 1, width_min = "5.5cm") %>%
kableExtra::column_spec(column = 2:5, width_min = "5.75cm") %>%
kableExtra::column_spec(column = 6, width_min = "2.25cm")| Adelie (N=152) | Chinstrap (N=68) | Gentoo (N=124) | Total (N=344) | p value | |
|---|---|---|---|---|---|
| **Sex** | 0.976 | ||||
| - N-Miss | 6 | 0 | 5 | 11 | |
| - female | 73 (50.0%) | 34 (50.0%) | 58 (48.7%) | 165 (49.5%) | |
| - male | 73 (50.0%) | 34 (50.0%) | 61 (51.3%) | 168 (50.5%) | |
| **Island** | < 0.001 | ||||
| - Biscoe | 44 (28.9%) | 0 (0.0%) | 124 (100.0%) | 168 (48.8%) | |
| - Dream | 56 (36.8%) | 68 (100.0%) | 0 (0.0%) | 124 (36.0%) | |
| - Torgersen | 52 (34.2%) | 0 (0.0%) | 0 (0.0%) | 52 (15.1%) | |
| **Bill length (mm)** | < 0.001 | ||||
| - N-Miss | 1 | 0 | 1 | 2 | |
| - Mean (SD) | 38.79 (2.66) | 48.83 (3.34) | 47.50 (3.08) | 43.92 (5.46) | |
| - Median (Q1, Q3) | 38.80 (36.75, 40.75) | 49.55 (46.35, 51.08) | 47.30 (45.30, 49.55) | 44.45 (39.23, 48.50) | |
| - Range | 32.10 - 46.00 | 40.90 - 58.00 | 40.90 - 59.60 | 32.10 - 59.60 |